Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SWDEV-497665] Blocked cudaMemcpyAsync race condition by synchronizing #1447

Conversation

corey-derochie-amd
Copy link
Collaborator

@corey-derochie-amd corey-derochie-amd commented Dec 4, 2024

Details

Do not mention proprietary info or link to internal work items in this PR.

Work item: SWDEV-497665 , SWDEV-501207 , SWDEV-507270

What were the changes?
Changed some cudaMemcpyAsync calls to be synchronized.

Why were the changes made?
In anticipation of behavioural changes being introduced to hipMemcpyAsync in ROCm 6.3.

How was the outcome achieved?
Switched calls to cudaMemcpyAsync to be cudaMemcpy in ncclTransportP2pSetup to avoid race condition with cudaIpcOpenMemHandle inside p2p connect. See ncclP2pImportShareableBuffer.

Additional Documentation:
What else should the reviewer know?

Approval Checklist

Do not approve until these items are satisfied.

  • Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

@corey-derochie-amd corey-derochie-amd self-assigned this Dec 4, 2024
…ortP2pSetup` to avoid race condition with `cudaIpcOpenMemHandle` inside p2p `connect`. See `ncclP2pImportShareableBuffer`.
@corey-derochie-amd corey-derochie-amd merged commit c158d3a into ROCm:develop Jan 3, 2025
34 checks passed
@corey-derochie-amd corey-derochie-amd deleted the SWDEV-497665_fix-async-memcpy-crash branch January 3, 2025 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants